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ABSTRACT 


We study the problem of the unsupervised learning of graphical models in mixed discrete-continuous 
domains. The problem of unsupervised learning of such models in discrete domains alone is notoriously 
challenging, compounded by the fact that inference is computationally demanding. The situation is generally 
believed to be significantly worse in discrete-continuous domains: estimating the unknown probability 
distribution of given samples is often limited in practice to a handful of parametric forms, and in addition 
to that, computing conditional queries need to carefully handle low-probability regions in safety-critical 
applications. In recent years, the regime of tractable learning has emerged, which attempts to learn a graphical 
model that permits efficient inference. Most of the results in this regime are based on arithmetic circuits, 
for which inference is linear in the size of the obtained circuit. In this work, we show how, with minimal 
modifications, such regimes can be generalized by leveraging efficient density estimation schemes based on 
piecewise polynomial approximations. Our framework is realized on a recent computational abstraction that 
permits efficient inference for a range of queries in the underlying language. Our empirical results show that 
our approach is effective, and allows a study of the trade-off between the granularity of the learned model 
and its predictive power. 
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1. INTRODUCTION 


Probabilistic representations, such as Bayesian and Markov networks, are fundamental to statistical 
machine learning and have applications in a range of domains, including biology, physics and robotics [1]. 
The attractiveness of such networks is that they can express probabilistic dependencies in a compact 
manner. Unfortunately, exact inference in these representations is intractable [2, 3]. Naturally, in the era of 
big data, there is also enormous interest in learning such structures directly from data. However, owing to 
the intractability of inference, learning also becomes challenging, since learning typically uses inference 
as a sub-routine [4], and moreover, even if such a representation is learned, the prediction will suffer 
because inference has to be approximated. 


Tractable learning is a powerful new paradigm that attempts to learn representations that support efficient 
probabilistic querying. Much of the initial work focused on low tree-width models [5], but later, building 
on properties such as local structure [6], data structures such as arithmetic circuits (ACs) emerged. These 
circuit learners can also represent high tree-width models and enable exact inference for a range of queries 
in time polynomial in the circuit size. Sum-product networks (SPNs) [7] are instances of ACs with an elegant 
recursive structure — essentially, an SPN is a weighted sum of products of SPNs, and the base case is a leaf 
node denoting a tractable probability distribution (e.g., a univariate Bernoulli distribution). In so much as 
deep learning models can be understood as graphical models with multiple hidden variables, SPNs can be 
seen as a tractable deep architecture. Standard deep architectures rely on many layers of hidden variables 
for greater representational power, but inference with even a single layer is generally intractable, with 
additional layers exacerbating the problem. Tractable architectures, such as SPNs, have the advantage over 
standard deep learning methods in that they can learn unsupervised and they have full probabilistic 
semantics where inference is guaranteed to be tractable [8]. Of course, learning the architecture of standard 
deep models is very challenging [9], and in contrast, SPNs, by their very design, offer a reliable structure 
learning paradigm. While it is possible to specify SPNs by hand, weight learning is additionally required 
to obtain a probability distribution, but also the specification of SPNs has to obey conditions of completeness 
and decomposability, all of which makes structure learning an obvious choice. Since SPNs were introduced, 
a number of structure learning frameworks have been developed for those and related data structures, 
e.g., [10, 11, 12]. 


In this work, we study the problem of the unsupervised learning of tractable graphical models in mixed 
discrete-continuous domains. In general, tractability and learning issues aside, even in the inference context, 
the situation is generally believed to be significantly harder in discrete-continuous domains: estimating the 
unknown probability distribution of given samples is often limited in practice to a handful of parametric 
forms, and in addition to that, computing conditional queries needs to carefully handle low-probability 
regions in safety-critical applications, such as autonomous vehicles and health monitoring. Techniques 
based on Gaussian, conditional linear Gaussian [13] and mixing exponential families [14] are most 
common, for example. 


To address the problem, then, we would first need a framework where it becomes possible to reason 
about continuous and mixed discrete-continuous event spaces in a general manner. With discrete models, 
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weighted model counting (WMC) has emerged as an assembly language for state-of-the-art reasoning in 
Bayesian networks, factor graphs, probabilistic programs, and probabilistic databases [6, 15]. Exact WMC 
solvers are able to handle low-probability observations, and the query language can support arbitrary 
propositional formulas over logical connectives, including disjunctions, which make them particularly 
useful in the presence of logical soft and hard constraints. However, WMC is limited to discrete finite- 
outcome distributions only, and little was understood about whether a suitable extension exists for continuous 
and discrete-continuous random variables, until recently. The framework of weighted model integration 
(WMI) [16] extended the usual WMC setting by allowing real-valued variables over symbolic weight 
functions, as opposed to purely numeric weights in WMC. These symbolic weights can take the form of 
piecewise polynomials, for example, and thus, WMI appeals to the emerging interest in density estimation 
by piecewise polynomial approximations [17, 18, 19, 20]. Essentially, these approximations can be made 
arbitrarily close to exponential families, among others including log-concave distributions, mixtures of 
Gaussians and Poisson Binomial distributions [21]. WMI additionally enables efficient integration [22] and 
recent WMI solvers [23, 24] are maturing quickly in the sense of attempting to leverage the powerful 
strategies already used in WMC. In essence, WMI is a strict generalization of WMC [16]; for example, just 
as inference in discrete graphical models reduces to WMC [2, 6], obtained from the sum of products of 
atoms’ weights in a propositional model, inference in hybrid graphical models reduces to the WMI, obtained 
by integrating the product of atoms’ densities in a first-order model. For example, a Gaussian distribution 
N(5,1) for a random variable x can be represented using (truncated) models seen in Figure 1. 
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Figure 1. Comparison of piecewise constant vs polynomial using a 7-piece function. 


As a computational strategy, what makes WMI particularly effective is that: (a) the intervals can be 
understood as (and mapped to) propositions [16], known in the literature as propositional abstraction [23], 
allowing us to piggyback on any propositional solver, and (b) WMI admits complex interval queries, such 
as Pr(x > 1:5 | y < 2) where x, y are random variables, yielding a powerful query interface. Notice here 
that the intervals in the queries do not have to correspond to the intervals used for specifying the distribution, 
and can overlap and subsume them arbitrarily. 


To leverage WMI and piecewise polynomials more generally with structure learning, consider that the 
base case for SPNs is a tractable distribution, and so, SPNs are essentially a schema for composing tractable 
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distributions. Although much of the literature focuses on univariate Bernoulli distributions [10, 12], 
continuous models can be considered too, which makes SPNs very appealing because real-world data are 
often hybrid, with discrete, categorical and continuous entities. Nonetheless, this rests on the assumption 
of having a tractable query interface to the underlying continuous and hybrid distributions. In [11], for 
example, the leaf nodes are assumed to be drawn from a Gaussian, and similarly, and [25, 26] assume the 
data are drawn from other families with a known parametric form. 


In this work, we show how, with minimal modifications, structure learning can be generalized by 
leveraging efficient density estimation schemes based on piecewise polynomial approximations. Although 
SPN integration with parametric and non-parametric models is receiving increased attention, to the best of 
our knowledge, we are the first to achieve full integration of WMI and its powerful query mechanism, 
together with tractability guarantees regarding exact integration whilst achieving a clean separation of 
circuit learning from polynomial weight learning. In particular, owing to the tractability of integration over 
polynomial densities [27], inference becomes tractable, and can be computed by doing a few passes over 
the network. Moreover, among other things, the learning component allows us to induce both parametric 
and non-parametric models and their mixtures, including models such as Figure 1 and Figure 2, but with 
no prior knowledge of the true distribution, all of which are further composed over hidden layers in a deep 
probabilistic architecture. In other words, leveraging SPNs enables a fully unsupervised learning regime for 
hybrid data with WMI acting as an underlying inference framework. From a representational viewpoint, 
our framework allows the modeller to study the granularity of the continuous distributions (in terms of the 
number of piecewise components and the degree of the polynomial fit), and determine the empirical trade- 
off between accuracy and model size/interpretability. By lifting the propositional abstraction strategy of 
WMI, the framework is also instantiated in a manner that allows us to use any SPN learner in principle. To 
reiterate, to the best of our knowledge, this is a first attempt to combine WMI and circuit learning in their 
full generality, with an interface for complex interval queries. We will define this precisely in Section 3, 
but briefly the idea is that we can handle conjunctions and disjunctions of probabilistic queries, in the 
sense that these queries involve interval formulas but these intervals are not required to be mentioned in 
the specification of the distributions in an SPN. As a contrasting example, say we have the learning problem 
of credit risk. A deep neural network (DNN) would learn to classify an input customer as being either BAD 
for a customer potentially reneging on paying back a bank load or GOOD if the customer is likely to pay 
back a loan. With standard DNNs inference would be on the classification as Pr(class = BAD | X), where 
X is a vector of customer features (age, job, creditamount, savingsaccount, etc.). Inference 
requires X to have instantiated values but does not allow for more nuanced questions. In contrast with 
SPNs, specifically with our integration of WMI, we can now perform inference such as with Pr(class = 
BAD | job = Unemployed ^ 7,500 < creditamount < 9,000) which allows for performing inference on 
discrete and continuous features with specified ranges with respect to the continuous features. As we will 
see, handling such queries requires multiple passes over the SPN. The code for the framework is available 
on FigShare®. 


© https://doi.org/10.6084/m9.figshare.13012262.v1. 
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Figure 2. The piecewise polynomial approximate algorithm of a mixture of a Gaussian and Beta distribution. 


This article is organized as follows. We discuss related work followed by the formal foundations for our 
work and then turn to our approach that integrates WMI and SPNs. We then discuss our interface for 
complex interval queries. We turn to empirical evaluations after that and finally conclude. 


2. RELATED WORK 


In addition to the related efforts we mentioned above, we remark that in a recent and independent 
effort®, [28] introduces so-called Mixed SPNs (MSPNs). Like our work, they are motivated by piecewise 
approximations for learning from hybrid data. They propose a decomposition and conditioning approach 
for such SPNs employing the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient [29]. While 
close in spirit, there are several significant differences to our work. Firstly, although their proposal is 
motivated by piecewise approximations, they focus on the learning of piecewise constant or linear models. 
From a representational viewpoint, this seems like a shortcoming, because in principle these weight 
functions can be effectively mapped to propositional SPNs with numeric weights, which remove much of 
the complexity, and hence appeal when finer polynomial density approximations are needed®. Secondly, 
the realization of an appropriate query interface that dynamically handles arbitrarily complex interval 
queries is not a focus of the paper — this is a critical feature with regard to our model as we take advantage 
of WMI. From a computational viewpoint, such a realization requires, as we show, subtle decompositions 
and multiple traversals to compute the resulting probability. To reiterate, what we strive for here is a full 
integration of SPNs and WMI. Thirdly, and perhaps most significantly, our approach neatly separates the 
propositional layer from the continuous aspects (including the integration) through pre-processing the data, 
which makes our framework generic, and applicable to any SPN learner. Finally, in our empirical evaluations, 


® A preliminary version of our work was online on July 2018: https://arxiv.org/abs/1807.05464. 

® It is possible that learning polynomials of arbitrary degree over intervals of arbitrary granularity are compatible with their 
approach, but without a discussion on the trade-off between accuracy and representation, and an algorithm for learning 
such polynomials, it is hard to assess such potential extensions. 
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we show that our approach significantly outperforms [28] on almost all data sets in terms of the average 
test log-likelihood measure. On the other hand, it is argued in [28] that the decomposition technique 
naturally handles random variables of unknown type, whereas we treat all real-valued variables via 
intervalization and piecewise polynomial approximations. So it may be fruitful to combine the strengths of 
both Renyi decomposition of variables of unknown type with our handling of continuous variables of 
unknown distribution, which we leave for future research. 


Regarding inference and MSPNs, a later paper [30] expands MSPNs to include an inference specifically 
geared for Approximate Query Processing (AQP). AQP aims to deliver an estimate of the query given a 
fixed time-bound. Kulessa et al. proposed two strategies for querying, and one based on likelihood inference 
on the model and the other based on sampling. The differences between [30] and our approach are notably 
similar to the differences our approach had with [28], that is, treating linear piecewise approximations as 
constants, and not allowing user-defined subinterval queries. Nonetheless, given the focus on SQL queries, 
the nature of querying is far broader than our approach. We also note the inclusion of sampling is a unique 
approach which holds merit given the broad set of user-defined SQL queries allowed, and we observe that 
the inclusion of our methods of complex querying would be a constructive addition to their SQL syntax. 
We elaborate further in Section 5. 


Another body of work released after our initial findings also addresses inference concerning MSPNs [31]. 
Both our work and [31] use a similar separation of structure learning with distribution modeling at the leaf 
nodes. [31] relies on MSPNs to derive the structure of the data and then use Gibbs sampling in conjunction 
with a likelihood dictionary (e.g., Gaussian, Gamma, Exponential distributions) to approximate variable 
distributions at the leaf nodes (parameters for parametric models). This approach replaces the linear 
piecewise polynomials from [28] and uses a Bayesian inference approach with a Gibbs sampling scheme. 
They can perform interval querying on continuous variables that are similar to our approach (complex 
querying), as well as learn on missing data which we do not handle. Our approach, however, is not reliant 
on a likelihood dictionary of parametric distributions nor sampling for parametric parameters. As will be 
explained in Section 4, our model is more aligned with the general tractable inference mechanism of SPNs 
with distribution modeling handled by piecewise polynomial learning on the data. It would be interesting 
to combine our piecewise polynomial learning modeling method with the likelihood dictionary and explore 
a sampling approach in performing inference for future endeavors. Nonetheless, we note that we perform 
exact inference in our current framework, which we think fits well with the tractable learning paradigm of 
learning structures that permit efficient exact inference. Leveraging a sampling-based method would have 
to be approached carefully, and the merits and demerits evaluated thoroughly. 


Outside of the above lines of research, research in learning in hybrid domains has been primarily limited 
to well-known parametric families. For example, [13] focuses on conditional linear Gaussian models, 
and [14] focuses on mixing exponential families. This is perhaps not surprising, because inference in hybrid 
domains has been primarily limited to approximate computations, e.g., [32], and/or Gaussian models [33]. 
In a similar spirit, approaches such as [34, 35] consider learning of complex symbolic constraints by 
assuming base distributions as being either Gaussian or a softmax equality. Another inference direction for 
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hybrid domains was done by [36], where SPNs interchange integrals with sum nodes to evaluate on 
continuous features that are provided with arbitrary input distributions. While that is a novel approach, we 
maintain an integration scheme at the leaf nodes to handle arbitrary interval queries. 


A particular body work which utilizes parametric families for modelling hybrid distributions is Automatic 
Bayesian Density Analysis (ABDA) [31], which uses the SPN architecture to cluster feature distributions for 
mixture modelling, and SPN inference to impute missing values and for handling outlying values in the 
data. ABDA forgoes the piecewise linear approach of MSPNs and our piecewise polynomial method, 
as such we include comparisons of ABDA on hybrid data sets with our method in the experiments in 
Section 6. We also include experimental comparisons with Bayesian Sum-Product networks (BSPNs) which 
applies a Bayesian approach to structure learning by focusing on deriving so-called scope functions to 
formulate joint Bayesian functions over the structure and parameters of a SPN [37]. From our empirical 
evaluation we find our approach succeeds in beating both ABDA and BSPNs. 


While our research focuses on structure learning, we mention an interesting framework which does away 
with SPN structure learning by constructing random region graphs populated with SPN nodes, so called 
random and tensorized SPNs (RAT-SPNs) [38]. A key aspect of RAT-SPNs is that by training the model 
discriminatively, it yields a classier on par with deep architectures. As the learning objective of a RAT-SPN 
can be supervised as opposed to unsupervised, a future direction of research could be the reframing of our 
learning method to be supervised by incorporating the discriminative training framework of RAT-SPNs. 
Another future research direction would be investigation into Einsum Networks (EiNet), which computes 
SPN sum and product nodes on the same topological layer via a single monolithic einsum operation [39]. 
Incorporating EiNet operations could lead to scaling up of large data set modelling with respect to our method. 


From an inference viewpoint, WMI serves as a computational abstraction for exact and approximate 
inference [16, 40] based on piecewise polynomial approximations. It is thus different in spirit to variational 
methods and analytic closed-form solutions for parametric families. We refer readers to [16] for discussions. 
WMI has enjoyed considerable advances in recent years [41, 42], and in that sense, our framework might 
be the starting point for closing the gap between inference and learning. 


3. INFERENCE FRAMEWORK AND GRAPH ARCHITECTURE 


This section presents the core components of our methodology by explaining WMI and SPNs. Emphasis 
is given to WMI's relationship with complex queries, including a formal definition of complex queries. We 
also provide an illustration of the probabilistic modelling capabilities of SPNs, as well as basic notation 
used throughout. 


3.1 Weighted Model Integration 


Given a propositional formula A, the task of model counting is to compute the total number of satisfying 
assignments for A. Given an assignment (model or world) M, we write M F A to denote satisfaction. For 
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example, p v q has 3 models, and a satisfying ratio of 3/4. Weighted model counting (WMC) is its extension 
that additionally accords weights to the models [6]. Models are usually accorded weights by computing 
the product of weighted literals: that is, assuming w maps literals in A to positive reals, we define: 


WMC(A,w) = X [ Jw) (1) 
MtAleM 
where the sum ranges over propositional models, and / e M denotes the literals true at the model M. For 
example, suppose p and q are accorded a weight of .6 and .3, respectively, with the understanding that 
the weights of a negated atom —a = 1 — w(a). Then, the weight accorded to [p = 1, q = 1] would be .18, 
and WMC(p v q,w) = .18 + .42 + .12 = .72. 


WMC is a state-of-the-art exact inference scheme, and owing to its generality, inference in a number of 
formalisms, such as relational Bayesian networks and probabilistic programs [15], are encoded as WMC 
tasks. Given a formula A, a weight function w, a query q and evidence e, probabilities are computed by 
means of the expression: Pr(q | e, A) = WMC(A a q a e, w)/WMC(A a e, w). 


Due to the propositional setting, however, WMC is limited to finite domain discrete random variables, which 
have spurred considerable interest in generalizing WMC to continuous and hybrid distributions [16, 18, 43]. 
Weighted model integration (WMI) is a computational abstraction for computing probabilities with 
continuous and mixed discrete-continuous distributions [16]. The key idea is to additionally consider 
inequality atoms, such as 0 < x < 10, along with possibly non-numeric weights, such as x’. The intuition is 
to let the inequality atom denote an interval of the domain of a density function, and let the weight define 
the function for that interval®. Formally, WMC was based on weighted propositional knowledge bases, and 
leverages SAT (propositional satisfiability) technology, and by extension, WMI is based on weighted linear 
arithmetic theories and leverages satisfiability modulo theory (SMT) technology [44]. 


Given such a knowledge base, the weighted model integration (WMI) is obtained from the model count 
of the theory together with a volume computation for the intervals. Formally: 


WMI(A,w)= yf [Io (2) 
M 


= le 
Mra (tle) 


where A- denotes a propositional abstraction, based on a mapping from linear arithmetic expressions to 
propositional atoms, and A* denotes a refinement, where the atoms are mapped back to their linear 
arithmetic expressions. For example, (0 < x < 10) v q on abstraction yields p v q, where p is a fresh atom 
chosen so as not to conflict with any of the existing atoms in the theory, which has a model count of 3 on 
abstraction. 


® Oblique supports such as (x > y) or (x — y > 5) may also be considered [45], but since we will be interested in weighted 
mixtures of univariate distributions, we will direct our attention to intervals here. Intervals can also be used to define many 
classes of multivariate distributions via hyper cubes [45]. These are intervals extended to multiple dimensions. 
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Example 1 [16] 

Suppose A is the following formula: (0 < x < 10) v q, where p is the propositional abstraction for (0 < x 
< 10). We also suppose that the weights are given with w(p) = x? and w(q) = 0.3, and the weights for 
negated atoms are w(-p) = 1 and w(-q) = 0, respectively. There are three models for A: 


1. M = {p,q} : Since wng) = 0, by definition we have 
x? 
—p*_ ayo 
foao P -w(=~g) =| 3 0h =0 
2. M=fpq}: f- „1 0-3dx =10.3x1? =3 
3. M={pq}: f X? 0-3dx = 10.11? = 100 
Thus WMI(A,w) = 100 + 3 + 0 = 103. 


To understand the intuition here, consider that in the one-variable setting, we are assuming a density 
function f : R > R of the following form [45]: 


a, tax+ +a xX forxeA,i=1,,k 
f(x) -| Oi li ni i (3) 


0 otherwise 


where Aj, ..., A, are disjoint intervals in R that do not depend on x and a, ..., a,; are constants for all i. 
We will say that f is a k-piece n-degree piecewise polynomial density approximation. 


This is then encoded as a WMI atom A, (e.g., x € [«, p1) with weight w(A)) = ao; + ... + a,x". For example, 
suppose we also have the query atom « < x < f and e = true. (We treat the weights of query atoms and 
their negations as being 1.) On abstraction, we would use a proposition p to stand for the interval A;. Clearly, 
calculating the probability of the interval simply amounts to calculating the area under the curve given by 
the polynomial w(A). 


Prix € [a BI) = [ayy +--+ ayx")dx (4) 
This can also be approached as: 
Pr(p) = Jode = i” pêo +++ a,;X”)dx (5) 


Naturally, if the query refers to intervals not appearing in the specification, we have to decompose the 
problem. We will return to this point in more detail later (Section 5), but the rough idea is that given two 
pieces of the density function f, say A; = % < x < fı and A, = pı < x < fa with weights P,(x) and P(x), 
clearly the probability of x e [a*, p*] where a, < a* < f, and B, < B* < B, would be obtained as: 


Pr(x e la", B'}) = J sPxidx + i, P,(x)dx (6) 
* 1 
While the mathematical treatment is immediate, computing these probabilities dynamically requires, as 


we will see, multiple passes over the SPN and so we will refer to such queries as complex queries. 
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Definition 1 Complex Queries: Given a k-piece n-degree piecewise polynomial density function f for r.v. 
x, defined over intervals A;, ..., Ay, we define a complex query as an interval query q:a* < x < B* where: 


1). a* or p* is not one of the extreme points in Aj, ..., Ag. 
2). given two (not necessarily complex) queries q and q’, we are interested in the probability of q o q’, 
where o €{a, v} or the probability of ~q. 


In general, the WMI apparatus is very flexible and on top of complex querying with continuous variables, 
we can also combine complex querying with Boolean variables. 


3.1.1 Weighted Model Learning 


Weight learning in the past [16] has focused on piecewise-constant density approximations where 
features are linear arithmetic sentences and weights are constants. This was addressed as follows: given a 
formula f, we wish to ensure that E,[f] = E,[f] by maximum likelihood estimation (MLE) with weights 
w = (wi, ..., w,) and data D. This states that the empirical expectation of f, in D (for example the count) 
equals its expectation according to the current parameterization. 


More precisely given an SMT theory A, the empirical expectation of a formula « € A can be obtained 
by calculating: 


E,[«] = size(d | da and d € D) / size(D) (7) 


This simply counts the items in D which are true for «. Given a weight function w, the expectation of 
a e A wrt w is given by: 


E,, [a] = WMI(a,w)/Z (8) 


Here Z is a normalization constant, also referred to as the partition function. In our case, the partition 
function equals the integral of weights of all models of A, thus Z = WMI(A,w). Based on this formulation, 
standard iterative methods for convex optimization can be used for weight learning [16]. In our setting 
integration with circuit learning removes the burden of parameter learning from WMI, and moreover, we 
are able to handle piecewise polynomials density approximations. 


3.2 Sum-Product Networks 


SPNs are rooted acyclic graphs whose internal nodes are sums and products, and leaf nodes are tractable 
distributions, such as Bernoulli and Gaussian distributions [7, 10]. More precisely, letting the scope of an 
SPN be the set of variables appearing in it, a recursive definition is given as follows: 


(1) a tractable univariate distribution is an SPN (the base case); (2) a product of SPNs with disjoint scopes 
is an SPN (denoting a factorisation over independent distributions); and (3) a weighted sum of SPNs with 
the same scope, where the weights are positive, is an SPN (denoting a mixture of distributions). 
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Formally, SPNs encode a function for every node that maps random variables X,, ..., X, to a (numeric) 
output. For node j, the corresponding function f, maps a world X, = x, ..., Xa = X, to its probability if it is 
a leaf node, the weighted sum Di Mill ines) for sum nodes where fj are the children of node j, and 
Ve ogee, for product nodes. Conditional probabilities are expressed using: 


x.) (9) 


m+ileersn 


PX raara [Meg spenep Xi) =F OG see) XA BX 


where 0 is the root node, and the notation f with only a subset of the arguments used in the denominator 
denotes marginalisation over the remaining arguments. The benefit of SPNs here is that a bottom-up pass 
allows us to compute conditional queries in time polynomial in the circuit size. (See, for example, [46] on 
other data structures that allow a more expressive class of queries.) 


3.3 Notation 


During our discussion, we use the following notation. Random variables are denoted using X or Y as 
well as X;. The set of possible values for a variable (or feature) X, are implicitly understood to be binned as 
intervals {X Xo +++, Xx}. Abusing notation, we treat x}, and x° to true and false values accorded to the 
Boolean variable serving as a propositional abstraction for the interval x„. We periodically generalize x,; 
with x; for a random variable X, where x; represents any given interval <k. We denote a data set as D, with 
T being the set of instances, and V being the set of variables. D’ represents a binarized data set, that is, 
after binning the points in D as intervals and interpreting those intervals as Boolean variables. We refer to 
queries as q and we also refer to log-likelihood function as £. For WMISPNs and SPNs, we use f, to refer 
to a general node j with f referring to the root node and fj referring to a node i with parent node j. 


4. TRACTABLE MODEL STRUCTURE LEARNING 


In this section we present our proposed algorithm which derives the structure and parameters for SPNs 
with piecewise polynomial leaves, equipped with WMI (WMISPNs). The method LearnWMISPN is capable 
of deriving such an SPN structure from hybrid data and further, it allows for the retention of the distribution 
parameters in the continuous case for use in advanced querying. 


The structure learning approach for LearnWMISPN is derived from LearnSPN [10]. LearnSPN is a recursive 
top-down learning method which is capable of learning the structure and weights for SPNs by identifying 
mutually independent variables and clustering similar instances. LearnSPN takes a simple approach to 
structure learning by recursively splitting data into a product of SPNs over independent variables, or a sum 
of SPNs comprising subsets of instances. More generally, LearnSPN represents an algorithm schema which 
allows for a modular approach in structure learning from differing domains, which is why our approach 
for learning changes only the base case. 


The primary extension with LearnWMISPN is the addition of generated polynomial leaf nodes. Given 
a data set with discrete, categorical and continuous attributes, LearnWMISPN learns polynomial weight 
functions as density approximations for the continuous cases, mapped to leaf nodes. By doing so, 
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LearnWMISPN provides a model which is capable of complex querying in the continuous case, while also 
conditioning over other variable types for hybrid domains. 


Prior to structure learning, a preprocessing step is performed which first transforms discrete, categorical, 
and continuous features into a binary representation. Much like the algorithm schema of LearnSPN, the 
preprocessing step can implement any number of unsupervised binning methods including using a mean 
split, equal frequency binning, or equal width binning [47]. This is followed by a polynomial learning 
function which identifies the polynomial function coefficients and probability densities for the continuous 
features in the hybrid domain without prior knowledge of the true density function. 


Algorithm 1 describes the recursive structure learning algorithm for LearnWMISPN and Figure 3 illustrates 
the recursive pipeline. LearnWMISPN takes as input the preprocessed binary data set D’ with T instances 
and V’ variables and returns a WMISPN representing the distributions of the hybrid domains. LearnWMISPN 
recurses on subsets of V’ until a vector of unit length is found, at which point a corresponding univariate 
distribution is returned. 


recurse 
When |V| = 1 and A 
continuous, return 7 f 
smoothed polynomial «| fi 
a distribution i 
If variable independence “ | 
exists a e2] j i ~ 
i ‘N 
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Instance: Training set 


When |V| = Land «¿sos 
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Figure 3. A recursive algorithm for learning hybrid domains. 
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Algorithm 1. LearnWMISPN (T, V’). 


Input : T: set of instances, V’: set of variables, 6: parameters (cf. Algorithm 2) 
Output: SPN representing a distribution over V’ learned from T 
1 if |V’| = 1 then 


2 if variable V’ is an instance of a continuous feature then 

3 | return univariate distribution estimated 
from polynomial probability densities 

4 else 

5 | return univariate distribution estimated 
from binary counts in T 

6 end 

7 else 

8 partition V’ into approximately independent subsets V’ 

9 if success then 

10 | return [] ;LearnWMISPN(7, V’); 

11 else 

12 partition T into subsets of similar instances T; 
return X; 71 LearnWMISPN(7;, V’) 

13 end 

14 end 


The main novelty in LearnWMISPN is the handling of the base case. Without any prior knowledge, 
LearnWMISPN will return a smoothed univariate distribution for discrete and categorical variables, or 
piecewise polynomial distribution for continuous variables. These two outcomes set LearnWMISPN apart 
from LearnSPN, and this is represented with polynomial leaf nodes corresponding to continuous distributions. 


If the base case has not been reached, then LearnWMISPN continues the recursive structure learning of 
LearnSPN. For the decomposition step, a variable split is identified which results in mutually independent 
subsets, generating a product node for the resulting subset. Mutual independence for the variable splits is 
determined by a G-test for pairwise independence: 


L , i yj 
G(x, X) = 25 Fe dlog 27 


mad (10) 
i=0 j=0 C(x; )e(x}) 


where the summations range over the elements of each variable interval x, and x, c(-) refers to the occurrence 
count for a setting of a variable pair or singleton, and T refers to the number of instances [10]. 


As an example, say we are checking the independence for two continuous features X, = x, and X, = x, 
(we ignore binning to focus on the test itself). As there are only four possible combinations from the count 
function c(xi,x}), we derive a 4x4 matrix which list the counts for c(x?,x3), c(x?,x3), C(xq,x9), and c(x},x3). 
We also calculate a count vector for x, and x,, respectively that counts the number activations (1) and 
negations (0) in each feature separately (represented by c(xi) and c(xj) in the equation). The empty sets 
Ø in the count vectors are then counted (it is only ever possible to have 1 empty set or 0 empty sets). From 
there we sum over the possible activation/negation cases using the G-test algorithm, and the final G-test 
value must be less than the threshold value calculated from features x, and x, to be classified as independent. 
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The threshold value is calculated as (2 x DOF x gfactor + 0.001). We can calculate the degrees of freedom 
(DOF) based on the number of states a feature can be represented by (2 in our case as it is either O or 1), 
and the empty set count for each feature 2-2, -1)x 2- Ø, — 1). The gfactor is chosen as 0.001 and 
we add 0.001 to prevent a threshold of zero. 


It is also noted that we perform the G-test on two features, say x,; and xp, derived from a continuous 
feature X; where a and b refer to a bin interval such that 1 < (a, b) < k given {Xa Xe} C X;. If Xa; has a large 
number of activations and x,; has a small number of negations (mostly O's with say a single activation 1), 
the G-test will still be able to determine that the features are dependent. We remark that the G-test is not 
a perfect metric for determining independence, but generally, we found it very successful when provided 
with binarized data. 


For the conditioning step, similar instances are clustered using hard incremental expected-maximization 
(EM) generating a sum node. The ratio split on the clustered subset of instances determines the corresponding 
weight and what is returned is the sum of the resulting weighted clusters. We condition initially on a slice 
S c D, which is a data structure composed of a subset instances (Ts c T) and a subset of features (V; c V) 
from the initial data set D. Clusters © are derived from slice S via 10 restarts through the subset of instances 
T; four times in random order. A new cluster C is added to C if the log-likelihood is greater than the cluster 


penalty likelihood -ø |V |+% 


1 . : 
y Nae where ø is the cluster penalty and F refers to the attribute count 
ve S + A 


for each feature (2 in our model). In order for a set of clusters to remain in the WMISPN structure and for 
learning to continue, the best set of clusters C is greedily chosen based on log-likelihood £ improvements: 
score= X g(L(C | S,), CP" +L(C |S) Y CEC (11) 
tel 
where C?"°" refers to the ratio of local instances Te to slice instances Ts, S, refers to the partition of data 
based on instance t, and C- refers to the previous cluster in the recursive formula g(-,-). The function g(-,-) 
is used to calculate the maximum for the log-domain: 


xt+In(l+e’™) if x>y 


g(x,y) -| (12) 


y +In(1+ e”) otherwise 


LearnSPN assumes a naive Bayes mixture model, where all variables are independent conditioned on the 
cluster; however LearnWMISPN applies a prior: 


Pr(V’) = X P(C) [ J Pr(X; | C)LPr(X) (13) 
i j 


where G; is the ith cluster and X; is the jth variable, but the prior Pr(X) is only applied when the feature is 
continuous. 


Overall, we upgrade both learning and querying over the models to handle piecewise polynomial density 
approximations. For the learning part, since only the base case is changed for our method, it suffices to 
show that the learning of the univariate distributions can be effective, which we discuss below. The querying 
is treated in the subsequent section. 
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Formally, let X; = (X ..., Xu) where k is the number of intervals for every continuous feature X, The 
polynomial learner provides to LearnWMISPN 4, a probability constant representing priors for the 
continuous interval x,, a being an arbitrary interval <k. In the base case of LearnWMISPN, a univariate 
distribution is returned but calculation is contingent on whether the feature is continuous or discrete. 
Given a continuous feature interval x,,, we wish to find the calculation of the unnormalized probability 
of a WMISPN #?(x,,). Here p represents the parent node to the child leaf / node f? which maps to an 
interval x,,;. 


The leaf likelihood f? (xx) for an interval x,; is calculated based on activations and negations of instances 
for the interval. For activations, we find the count activtions c}, = c(x!,), and negations c®, =c(x°). The 
likelihood for x,; is then calculated as: 


cA 


ai Xaj 


~ Gh Fe ) 


ai’ Xaj 


AP (Xa) (14) 
This calculation considers the one-hot encoded representation of x,; as negations correspond to the 
probability of (Xi; -<< Xai Xeni +++ Xx) being activated instead. 


Preprocessing Step: A preprocessing step is required to first transform the discrete, categorical, and 
continuous variables into a binary representation. The methods for converting a data set into a binary 
representation is contingent on the variable type in the hybrid data set. The preprocessing pipeline is 
summarized in Algorithm 2. 


Algorithm 2. Preprocessing step(D, f, k). 


Input : D: dataset of real values, f: vector of domain types, k: number of bins 
Output: D’: binary representation of D, M: matrix of coefficients for the polynomials, 
p:vector of probability densities 
D’ —EqualWidthBinning(D, f, k) 
M, p —PolynomialLearner(D, f, k) 


The first sub-task converts the real-valued data set D into a binary representation D’. In our model, 
equal-width binning was performed on continuous features, with the number of equal-width bins specified 
as a parameter. (A meta-learning step can be designed to choose the optimal number by studying log- 
likelihood or polynomial improvements.) Higher bin counts would increase the representative power and 
improve accuracy for complex queries on continuous distributions (see section 5). Each new bin generated 
from a feature represents a discrete range on that continuous distribution. A one-hot encoding transformation 
was performed on the expanded feature set to get a binary representation. For discrete and categorical 
variables, the binary representation is achieved again with a one-hot encoded transformation with equal 
width binning not taken into consideration. 


As an example for one-hot encoding in the discrete case, consider a random variable Y which has a 
domain of boolean values. Booleans result in two classification labels so Y is expanded to (y,, y2), which 
now correspond to each boolean instance. Formally: 
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=1if 
vozi” i (15) 
y, =1if =x 


The two cases for Y are then represented as [x] > [1, 0] and [>x] — [0, 1], respectively. Now consider the 
continuous case, where equal width binning is implemented, we say that random variable X is defined as 
having a range [a, b] > x e R :a <x <b. If we split X into say three bins (x,, x2, x3), with the split point 
defined as s = (b — a)/3 then we again have another representation for one hot encoding. Formally: 


X,=1if a<x<(ats) 
X(x) = 4X, =1if (a+s) <x <(a+ 2s) (16) 
x =1if (a+ 2s)<x 


So for the minimum and maximum values in X we get the following one-hot encoded representation 
[a] > [1, 0, O] and [b] = [0, 0, 1]. This means the binary transformation of an instance D,. = [b, x] is 
Dj, =[0,0,1,1,0]. 


It is worth noting that increasing bin intervals leads to overfitting of the data and spurious likelihood 
values. Consider a data set D with M instances and a N-sized feature space, and consider the effect of 


increasing interval counts k (where k > M) on a given continuous random variable X; where Yc) <M 
and Fi cx) <(k-M). Binning results in an increasing feature space N, which leads to continuous 
intervals x; frequently being O. This implies that prior likelihoods of activations decreases, while prior 
likelihoods for deactivations increases, which can also be seen from the one-hot encoded representation 
that leads to additional features with no values. Such problems are to be expected with overfitting; however, 
as we will see in the following section, we use an information criterion to regularize bin counts, choose 
an optimal number of bins and temper the piecewise polynomial function degree in the model. This is 
further evidenced by our empirical evaluations (cf. Table 2 and 3), which shows that our results naturally 
converge to < 5 bins per continuous feature. 


Polynomial Learner: The second task learns a piecewise polynomial (PP) approximation for any univariate 
probability density function (PDF) without prior knowledge or simplifying assumptions. The original data 
set D, the indices for the continuous variables, and the number of bins k are taken as input, and as output, 
we receive a vector of probability densities for each continuous feature bin. This bin information is retained 
in its respective polynomial leaf node in the SPN for use in complex querying. 


The piecewise polynomial functions are calculated through a linear combination of b-splines as described 
in [48]. B-splines are polynomial curves that are right-side continuous, differentiable, and they form a basis 
in the space of piecewise polynomials. This follows that linear combinations of b-splines can represent any 
piecewise polynomial function®. 


® Note that we may place additional constraints on the learning regime. For example, [17] show that they can find a candidate 
hypothesis f € P,,,(/) such that ||f — g||, g being the unknown true density, can be made arbitrarily small. We chose the 
basis-spline technique, but none of our technical results hinge on opting for either of these methods. 
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B-spline interpolation can then be used to find an approximation of the density function as a linear 
combination of (Z = k + r — 1) b-splines, where r defines the degree order spanning the whole domain. 
Thus, for learning a given b-spline, only the order-r which is equivalent to r = degree — n + 1, the number 


of intervals defined by a binning method-k, and the knot sequence (ô = (ap, ..., ap) which refers to where 
pieces of the polynomial intersect are required. We define the j” b-spline B;(x), where j = 1, ..., Z as 
follows: 
"(Arup — X) Aaya — X) 
B! = aay H Lea j-r+t j-r+t 17 
we a a " 4-4 Wj -e(Ay rat) i i 


where w;,(x) is the first derivative of w,_,(x) = IIx- a_,,,) and H(x) is the Heaviside function: 


7 J x20 18) 
w O x<0 ( 


Our objective now is to learn piecewise polynomial (PP) weights for each of the intervals. So suppose 
g : I > R, is the density of an unknown distribution over /, where / is an interval. We would like to find 
a candidate PP hypothesis f. We say f is a k-piecewise, degree-n polynomial(also called mixture of 
polynomials (MOP)) if there is a partition of / into k disjoint intervals (/,, ..., I) such that f(x) = f(x)Vx e l, 
and each (f, ..., fẹ) is a polynomial of degree <n. Let P;,,,(/) be the class of k-piecewise degree-n polynomials 
over /. We would now like to draw m < |D] samples that finds m candidate hypothesis fi, ..., fn € Perl, 
and outputs f* € (fi, ..., fm) that maximizes the likelihood of observing the samples. Note that k is fixed by 
the intervalisation step, and so we only need to iterate over the maximum degree of the polynomial and 
that we are only learning univariate distributions for the leaf node. 


The order, and, if not previously specified, the number of bins are chosen during training time by means 
of the Bayesian Information Criterion (BIC) [49]. 


_ log(|D)) 


BIC(x,f(x)) = L(x | F(x) J 


(19) 


Where £ denotes the log-likelihood. Use of BIC scores and b-spline interpolation follows the usual 
properties of probability density functions, such as guaranteeing that MOP approximations remain 
continuous, integrate to one, are non-negative, and free of prior assumptions [50]. 


The BIC scoring function has been shown to be robust and was chosen to avoid overfitting the model 
parameters. In addition, the chosen method favours smaller polynomial ranks [50] which improves the 
efficiency of the integration without losing accuracy. We reiterate that we do not make any assumptions 
about the true density®. 


®© Itis interesting to observe that B-splines have been influential in a number of disciplines, including physics, engineering, and 
computer vision, and our algorithm contributes, for the first time, “deep B-splines”. For the future, it would be interesting 
to study how LearnWMISPN would be applied in these domains, for modelling latent dependencies in computer graphics, 
for example. 
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5. PROBABILISTIC MODEL QUERYING 


This section is devoted to expanding out definition of querying with respect to WMISPNs. We first detail 
the means by which SPNs perform tractable inference, then we illustrate our integration of complex querying 
into the WMISPN framework. 


SPNs provide tractable querying on univariate distributions and can calculate the marginal and conditional 
probabilities on learned features. Other schemas for SPNs limit the scope of querying to binary activations 
Pr(X; = 1) to the leaf nodes [7]. In WMISPNs, we are able to expand our queries to complex cases where 
leaf nodes maps X; to new probability ranges (a < x < b), a, be R. 


5.1 SPN Inference 


General SPN inference is initiated at the leaf nodes where queries are presented in the form of selected 
activations of the random variables q(X, = 1, X = 0, ..., Xn = 1). The mapped probabilities x; propagate 
upward from the leaf nodes where values from children nodes are either multiplied at product nodes, or 
the weighted sum is taken at sum nodes. The final likelihood of the queries is returned at the root node 0. 


To normalize the returned likelihood, a partition function is required to calculate the normalization 
constant. Formally: 


Z =$ h(x) (20) 


Inference is deemed tractable due to the partition function being computational in time linear to the 
number of edges in the learned SPN, which in turn results in queries that can be calculated in time linear 
Pr(q) = fo(X1, 1- X, ..., X,)/Z. The same is true for conditional likelihood estimates. 


5.2 Complex Querying 


We extend the SPN inference model to allow complex queries where inference can be performed over 
continuous as well as discrete features, and combinations thereof: for example, queries on discrete features 
such as male(X) conditioned on continuous ranges such as 40 < weight(X) < 50. The advantage of complex 
queries, with their capability of performing inference on conjunctions of discrete and continuous feature 
ranges, allows modellers added ability in assessing the relationship of probability distributions in data. 
Taking our example of credit risk, complex querying can provide means to model the decision boundary, 
provided a trained model and a mixed discrete continuous data set. An exhaustive querying of a model 
with complex queries can determine what feature values would lead to a classification change. Taking 
the credit risk example again, complex queries of the form Pr(class = BAD|O < creditamount < k) and 
Pr(class = GOOD|k < creditamount < 10,000) where k indicates a real value boundary for the feature 
creditamount, can inform the modeller where the likelihood of classification changes. In this example we 
focus on one feature but complex queries allow for multiple conjunctions as will be explained further. 
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First, observe that LearnWMISPN is given probability constants as well as polynomial representations for 
continuous intervals based on the BIC score, thus standard univariate boolean queries do not require 
integration of new probability constants. In particular, suppose L(S PN) refers to the propositional abstraction 
of the SPN (that is, think of intervals as propositions and only refer to the probability constants), and let 
PP(S PN) refer to the continuous representation. We can then describe a query q € L(S PN) to mean a 
Boolean query defined as q(x, = b,,x,=1-,,...,x, = ,,x, =1- b,) where b; € {0, 1}. 


nln n 


Proposition 1 Suppose T is a learned WMISPN. Then the partition function of T for any q e L(T), the 
probability of q can be computed in time linear to the number of edges in T. 


Our system augments the above-described machinery for querying to allow for complex queries. This 
adds the ability to ask queries over arbitrary continuous ranges which in turn enables a mixed querying 
setup between continuous and discrete features in the same formulas. 


Standard SPN inference is initiated at the leaf nodes where queries are presented in the form of selected 
activations of the random variables (i.e., a state). The mapped probabilities propagate upward from the leaf 
nodes where values from children nodes are either multiplied at product nodes, or the weighted sum is 
taken at sum nodes. The final likelihood of the queries is returned at the root node. 


Inference for continuous attributes is based on WMI. Let us explain the intuition using a univariate 
distribution. Our structure learning algorithm above assigns ranges r; = [«,, pi] and their piecewise polynomial 
density approximation px) to a continuous SPN leaf node i. With that, we can perform inference at the 
leaf node itself through WMI. For example, assume a continuous leaf node for the attribute “weight” has 
the range 34 < weight(X) < 55 and suppose its calculated polynomial is -0.051+0.0016x. The probability 
for the query 40 < weight(X) < 50 can then be calculated using Ñ- 0.051+ 0.001 6x dx = 0.21. Once the 
leaf has processed the query its value is passed to the SPN where it is applied to the other variables of the 
query. We reiterate that inference using the polynomials is not performed in the SPN itself. Only the value 
is passed on to be interpreted. This demonstrates the built-in modularity of our approach. 


Naturally, posing a query such as (40 < weight(X) < 60), spanning over multiple intervals is less trivial. 
In this case, assume a second interval for weight(X) is specified 55 < weight(X) < 77 with a polynomial 
density of —0.1469-0.0019x. Then, the probability of the query is calculated as Pr(40 < weight(X) < 70) = 
Pr(40 < weight(X) < 55) + Pr(55 < weight(X) < 70) = ine ~0.051+ 0.0016x dx + fo. 469 — 0.0019x dx = 
0.375 +0.291= 0.666. In general, as the ranges were defined to be mutually exclusive, the calculation of 
the probability boils down to: 


1). dividing the range into m—k subranges where each corresponds to a leaf node i, i e {1, ..., n}, 
1<k<m<n; 

2). performing inference through WMI in each leaf with the polynomial density approximation p;x); 
and then 

3). adding the calculated probability values to obtain the final value. 
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That is, 
b B S ph b 
Í p(x)dx = J p,(x)dx + > Í P(x) +f Pm (X)dx (21) 
jaket “j %m 
By extension, if a query consists of multiple continuous features X,, ..., Xy, we find that we require a 


joint density approximation over the variables in order to infer the likelihood. We recall that given multiple 
dependent variables, the joint likelihood is calculated via Pr(X;, ..., Xx) = Pr(X1, -00 Xn) PriXy|X1, -o Xna) 
as per the chain rule. We also recall Equation (9) for the SPN conditional likelihood and apply it WMISPNs 
where we see the likelihood over dependent features presented as: 


f(X;) f(X5,X;) fy (Xp Xna Xo) 


Pr(X,,<«+, Xv) = sis (22) 

YOON Z AX) AU ee X) 
This can be reduced to f,(X;, ..., X,)/Z. Effectively, we are able to represent the joint density function 
P'-N(X,, ..., Xn) with a WMISPN model over the variables, and perform inference on the continuous features 


using the leaf nodes to integrate over the ranges: 
b. b 
f al NaN (X20 Xu dx, e dxy (23) 
a AN 


When integrating the model into the SPN some SPN properties need to be considered. The root node 
will always calculate the likelihood of a query by taking the product of its children. Given the mutual 
exclusivity of a continuous feature’s multiple ranges, two or more passes are required for an SPN to 
calculate the likelihood of a query atom over multiple intervals for the same feature. 


For any univariate query q(x) of the form (a < x < b), we define the query length to be the number of 
propositions (xı, ..., x,) such that g(x) © x # {}. Here x? is the refinement of the boolean activation which 
retrieves the interval it corresponds to. As an example, suppose leaves in the WMISPN mention « < x < f 
and ß < x < y. Then a query of length 1 is of the form «* < x < B*, where a < «* and f* < ß. A query of 
length 2 is of the form «* < x < y*, where « < «* as before but f < y* < y. In essence, complex querying 
finds the number of intervals in the WMISPN that q(x) intersects with. 


Proposition 2 Suppose T is a learned WMISPN and q € PP(T). The probability of the univariate interval 
query q of length k requires k passes through the WMISPN. 


If q e L(S PN) then query length is 1, we then immediately get Proposition 1 as a special case. For 
simplicity, we assume that queries are limited to the following syntax: 


1). Every query atom is either an interval x €e [a, b] or a discrete feature A € [a,, ..., anl- 
2). A query consists of a number of conjunctions. 
3). Each conjunction has to be distinct, e.g. no conjunction shares the same query atoms. 


Here, (2) and (3) are not fundamental restrictions: it is easy to extend (2) by applying the rule Pr(A v B) 
= Pr(A) + Pr(B) — Pr(A ^ B), and in the case of (3), linear arithmetic solvers can be used to reason about 
multiple constraints for the same variable: for example, Pr(30 < weight(X)) A Pr(50 < weight(X) < 60) can 
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be resolved to Pr(30 < weight(X) < 60). So, (1) says computing the probability of expressions such as x > y 
and x + y > 2z, where x, y, z are continuous variables, cardinality queries [8], among others are not dealt 
with currently. 


We remark that in [30], the tractable inference advantages of MSPNs were utilized for methods akin to 
our complex querying. The proposed Model-based Approximate Query Processing (AQP) takes advantage 
of MSPN inference, which while similar to our method, is specifically concerned with SQL style queries 
and speed of inference. As the queries posed are SQL in nature (user-defined functions, arithmetic 
expressions, etc.) the scope of inference queries is larger than that found with complex queries as defined 
here. [30] utilizes both probabilistic and sampling approaches for inference, specifically with sampling 
serving to aid in approximating complex aggregation queries. Our account is focused on exact inference 
currently. 


There are some shared techniques, such as readjusting of leaf weights for inference. [30] uses a sampling 
method with conditioned constraints, while our method uses the interval range on a given continuous 
feature to determine the weight adjustment at the leaf node. As mentioned, our approach is exact, does 
not use sampling methods, and as we are not investigating increasing inference speeds we find pruning of 
our WMISPN model unnecessary whereas Kulessa et al. actively do this. As [30] uses MSPNs, they calculate 
continuous variables with linear interpolations of histograms, as such their inequality queries are limited 
to linear approximations on continuous intervals (see Section 2) where they are treated as constants 
preventing user-specified inequality type queries (see Equation (6)). With our framework, we find our 
piecewise polynomial approximations allow for user-defined complex querying. On the other hand, the 
SQL mechanism has advantages such as biased sampling on rare sub-populations for ad-hoc querying as 
well as providing general SQL aggregation queries such as count, average, and sum. As such, we find it a 
useful research direction to combine complex querying with the SQL style queries with aggregation. 


6. EXPERIMENTAL EVALUATION 


The goal of this section is to evaluate the merits of LearnWMISPN and the complex query interface. 
Specifically, we attempt to address the following questions: 


Q1 How effective (in terms of log-likelihood) is LearnWMISPN on complex mixed discrete-continuous 
data? 

Q2 How reasonable is the learned distribution: that is, what is the order of the learned polynomials, 
and what is the spread of the probabilities for the underlying intervals? 

Q3 How effective is the query interface in the presence of increasing query lengths? 


We measured the performance of the learned SPNs on preprocessed hybrid domain data sets discussed 
in the independent effort [28] that introduced Mixed-Sum Product Networks (MSPNs), Automatic Bayesian 
Density Analysis (ABDA) [31], and Bayesian SPNs (BSPNs) [37], as discussed in Section 2. 
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Our performance evaluation of SPNs in conjunction with WMI was performed using three different 
regimes. We measured the performance of the learned SPNs from LearnWMISPN on preprocessed hybrid 
domain data sets. Doing so allowed us to directly compare the performance of LearnWMISPN to learn [28, 
31, 37], and test whether our approach with continuous distributions was effective. We then investigated 
whether further increasing the number of leaves per continuous feature resulted in improvements in the 
model. Next, measuring the complex querying capacity of WMI with SPNs was performed by recording 
the computation times dependent on query length. We then demonstrate complex querying accuracy by 
measuring error rates using synthetic data. As we mentioned before, WMI is a very flexible framework for 
inference in hybrid domains, and as our empirical results show, our integration with SPNs has yielded a 
scalable unsupervised learning architecture that is able to handle benchmarks and many (preprocessed) 
UCI data sets involving hundreds of variables. 


Q1 Hybrid Benchmarks: We evaluated LearnWMISPN on 11 hybrid domain data sets taken from the 
UCI machine learning repository [51]. Each data set was composed of differing proportions of categorical, 
discrete, and continuous features. The diverse set of domains comprised data from financial, medical, 
automotive and other sectors. The hybrid data sets were selected based on previous research in learning 
on hybrid domains [28, 31, 37]. As we wanted the capacity to make direct comparisons to previous 
research, we primarily focused on the data sets used in other works. The most important factor was that 
the data sets remain mixed with discrete and continuous features to take advantage of our methods 
integration of piecewise polynomial approximations. MSPNs, ABDA, and BSPNs were used as a comparative 
baselines in our evaluation. MSPNs proved to be a successful mixed probabilistic model [28], and given 
the similar nature of our objectives, a comparison was appropriate. ABDA and BSPNs relied on MSPNs as 
a comparative baseline in their respective works, thus they both were included as comparative models as 
well. Structure learning with LearnMSPN was performed with different variants. MSPNs were trained using 
the Gower distance, which is a metric over hybrid domains, with Gaussian distributional assumptions for 
continuous variables. MSPNs were also trained using the randomized dependency coefficient (RDC) which 
does not make parametric assumptions. For each model, histogram representations were compared against 
isometric regression models. 


The dimensions of the data sets can be seen in Table 1. Given the original data sets used real values, 
our preprocessing results in an augmented binary matrix that has a larger variable count. In the original 
work, more data sets were used to measure performance, but in our case, we omitted data sets which did 
not contain continuous features. We used the original LearnSPN training, validation, and test ratio splits, 
which were 75%, 10%, and 15%, respectively. 
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Table 1. Data set statistics. 


Data set |V| Iv] Train Valid Test 
anneal-U 38 95 673 90 134 
australian 15 50 517 69 103 
auto 26 85 119 16 23 
car 9 50 294 39 58 
cleave 14 35 222 29 44 
crx 15 54 488 65 97 
diabetes 9 33 576 76 115 
german 21 76 750 99 150 
german-org 25 70 750 99 150 
heart 14 35 202 27 40 
iris 5 11 112 15 22 


The log-likelihood results for Learn methods and LearnWMISPN are shown in Table 2. As can be seen, 
our model outperforms other models by a wide margin. The performance of our model results in significantly 
better likelihoods across the majority of data sets. In most cases, our implementation only required two 
bins per continuous feature, and one instance (the iris data set) required a three bin split in order to produce 
a likelihood competitive to the models. 


Table 2. Average test set log likelihoods for structured learning methods on hybrid data sets. 


WMI-SPN Gower- RDC- ABDA BSPN 
Data set 
LearnWMISPN (bin size) hist iso hist iso - - 

anneal-U -14.54 (2) -63.55 -38.83 -60.31 -38.31 -2.65 -16.44 
australian -10.47 (2) -18.51 -30.37 17.89 -31.02 -17.70 -21.51 
auto -27.12 (2) -72.99 -69.40 -73.37 -70.06 -70.62 -58.45 
car -9.11 (2) -30.46 -31.08 -29.13 -30.51 -35.04 -28.73 
cleave -13.82 (2) -26.13 -25.86 -29.13 -25.44 -22.60 -22.95 
crx -10.52 (2) -22.42 -31.62 -24.03 -31.72 -15.53 -11.54 
diabetes -6.29 (2) -15.28 -26.96 -15.93 -27.24 -17.48 -21.21 
german -20.42 (2) -40.82 -26.85 -38.82 -32.36 -32.10 -26.76 
german-org -21.14 (2) -43.61 -26.85 -37.45 -27.29 -26.29 -21.32 
heart -14.87 (2) -20.69 -26.99 -20.37 -25.90 -23.39 -22.93 
iris -3.56 (3) -3.601 -2.89 -3.44 -2.84 -2.96 -3.54 


Note: The bin size is given for the WMISPN log likelihoods. 


The effectiveness of the framework can be attributed to a number of factors. We observed that unsupervised 
binning interfaced with LearnSPN’s existing structure learning performed well on the hybrid data sets. This 
can be thought of as piecewise constant WMISPNs. The approach is simple and thus, fast, adding negligible 
complexity to the original LearnSPN machinery: the effectiveness in learning binary representations and 
the simple counting of Boolean variable activations is inherited from LearnSPN. In addition to that, we 
support the learning of polynomial weights, and not just constant or linear ones. In particular, our use of 
the BIC criteria to choose the most optimal representation, which has a bias towards smaller bin numbers 
and polynomial exponents. 
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Q2 Model Complexity: When investigating the correlation of model accuracy to model complexity, we 
used data sets with the majority variables in the continuous domain. We also used data sets with significantly 
more instances to learn on, ranges being from 1,024 to 17,389 in order to demonstrate the scalability of 
WMISPNs. The data sets investigated were the Cloud data set, Statlog (Shuttle) data set, and a subset of the 
MiniBooNE particle identification data set, which were all from the UCI machine learning repository [51]. 
Our inquires below use equal width binning as a discretisation method for continuous features. 


The most immediate question is what is the nature of learned polynomials. The BIC measure, as we 
mentioned, is used to determine the number of bins and polynomial order. So, by relaxing the criteria for 
the number of bins to be used, we can study the polynomials. In Table 3, we see statistics on the order of 
the learned polynomials for 2 vs 5 bins. The order never goes beyond 6, and in a majority of the cases is 
<4, confirming once again that the learned orders stay manageable. Increasing the bin size favours low- 
order polynomials: only Australia and Cloud have order 6 polynomials with 5 bins. 


Table 3. Comparison of the distributions of orders (in %) for 2 and 5 bins. 


Data set Bins 2nd-Order 3rd-Order Ath-Order 5th-Order 6th-Order 
Australia 2 0 16.667 16.667 33.3 33.3 
Australia 5 16.667 33.333 16.667 16.667 16.667 
Auto 2 15.8 36.842 31.579 12.789 0 
Auto 5 73.684 15.789 10.526 0 0) 
German-org 2 0 33.333 0 0 66.666 
German-org 5 33.333 33.333 0 33.333 0 
Heart 2 0 20 20 60 0) 
Heart 5 0 60 40 0 0) 
lris 2 50 0 50 0 0 
Iris 5 75 25 0) 0 0) 
Statlog 2 42.857 0 (0) 0 57.143 
Statlog 5 28.571 57.143 0 14.289 (0) 
Cloud 2 10 10 40 10 30 
Cloud 5 30 50 10 0 10 


In Figure 4, It is evident that increased binning results in piecewise polynomial functions that can better 
approximate the spread of data. In the case of the diabetes data set, at 5 bins the learned polynomial 
function was better able to capture the distribution of continuous feature points compared with the 2 bin 
approximation. The improved polynomial function approximation also lends itself to more accurate bin 
probabilities for SPN inference calculations. Also of note is the polynomial order for the diabetes continuous 
feature (Plasma glucose concentration) at 2 bins was a 4" order polynomial, while at 5 bins the polynomial 
order was only 2. 
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Figure 4. Comparison of learned piecewise polynomial functions for feature 2 (Plasma glucose concentration) of 
the diabetes data set. Note: Bin intervals are represented by alternating colors. 


The second question, then, is how are the probabilities spread from the learned representation? (That is, 
assume we learn p,(x) for 0 < x < 5 and p,(x) for 5 < x < 10; we consider the probabilities on computing 
the volumes and normalising.) Naturally, this spread depends very much on the data, but by considering 
multiple data sets, one can empirically study the effectiveness of the learning regime. We see in Figure 6 
that the representations match the characteristics of the data (e.g., sparseness for some attributes, missing 
values), diverse as they are. 


The third natural question is whether learning polynomials are beneficial at all? We mentioned earlier 
that unsupervised binning along with a simple binarisation scheme can be seen as a simplistic hybrid 
model — an instance of piecewise constant WMISPNs — for which LearnSPN suffices. Clearly, such an 
endeavour would come at a significant loss of expressiveness, e.g., no interval querying. So, by letting BIC 
determine the polynomial order but explicitly setting the bin parameter, we can contrast LearnSPN and 
LearnWMISPN. It should be noted that as each bin range corresponds to a distribution represented by 
a given polynomial, further bin increases on a feature result in differing polynomial representations. 
(Analogously, classical SPNs will redistribute the spread of discrete probabilities.) In Figure 5, the plotted 
performance of average log-likelihoods over bin complexity is normalized by the number of new variables 
generated from equal width binning. We see that accuracy on the data set does increase as the number of 
bins per feature increases, and thus LearnWMISPN yields a more accurate representation. 
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Figure 5. Instances of expanded Cloud data sets and resulting normalized log-likelihoods. 


Comparisons done with LearnWMISPN (red) and LearnSPN (green) [10]. It should be noted that 
LearnWMISPN is yielding a highly granular and intricate representation. For example, at 5 bins per 
feature, the first bin of the first feature (the pixel mean) defined over a range of (3.0 < x < 20.1) is given 
a density approximation that is a 3" degree polynomial, and then at 15 bins per feature, over a range of 
(3.0 < x < 8.7) the density approximation is now a 4" degree polynomial. 
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Figure 6. The spread of probabilities, with 2 bins per attribute. 
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Q3 Query Interface: The capacity to query SPNs trained on continuous domains represents a significant 
improvement in terms of interpreting data. With regard to posing complex queries to WMISPNs, we studied 
the computation time for inferences. The three data sets used for measuring this are the Cloud, Statlog 
(Shuttle), and MiniBooNE data sets. All data sets are comprised of continuous features, with Statlog 
containing a single discrete feature. In order to measure complexity, we refer to query length (see Section 
5) and refer to continuous queries as c' and discrete as d! with / representing the query count. We generated 
10 random queries based on the continuous ranges for each data set, including discrete outcomes for 
Statlog, and averaged the inference time. We also fixed the number of bins per continuous feature to two. 


In Table 4, we see the inference speeds on the complex queries. Overall, it is clear that complex queries 
do not slow down the model. For all query lengths, the time per query remains in the nanosecond range. 
As query length increases, more time is required but overall the increase in query time is linear. The model 
was also capable of handling queries with mixed continuous and discrete atoms, further demonstrating the 
capability of WMISPNs. 


Table 4. Average query time for differing query lengths. 


qı q2 q3 q4 qs 
Data set c! c Ç! d! c Cc d! ct en d! © Er d! 
Cloud 1401 2154 n/a 2563 n/a 3659 n/a 4025 n/a 
Statlog 1299 n/a 1878 n/a 2365 n/a 2749 n/a 2982 
MiniBooNE 1168 2071 n/a 2219 n/a 2290 n/a 3048 n/a 


Note: Time per query is measured in (ns/query). 


Finally, in order to measure the accuracy of the learned models, we calculated the inference likelihood 
mean square error (MSE) for three synthetic data sets. The data sets comprised a 1000-instance sampled 
from a Gaussian N(0.7, 02), an exponential (2 = 1), and a distribution mixture (0.5, 0.1) and Beta 
Be(2, 20). 150 complex queries were generated each of length 1. From our results in Figure 7, we observe 
the overall accuracy of the model in handling complex queries. The MSE also tends to approach lower 
values as the bin count increases, with the exception of the exponential distribution where the MSE was 
lowest with 3 bins but generally the rate remained steady. The learned WMISPN model also demonstrates 
robustness to handling inference on mixtures of distributions, again, seen by the decreasing error rate. 
Effectively our model is capable of performing both tractable inference, and returning accurate likelihoods 
for complex queries on various distributions. 
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Figure 7. The inference likelihood mean square error for sampled data taken from Gaussian, Exponential, and 
mixture distributions of Gaussian and Beta. 


It is worth mentioning again the comparisons here to MSPNs [28] and other works [30,31]. Regarding 
MSPNs, the differences in the general framework include our use of the G-test versus the use of the Renyi 
decomposition. As we abstract continuous features into discretized bins, it is sufficient to rely on the G-test. 
We also reiterate our use of piecewise polynomials over piecewise linear constants which allows our model 
to perform comparisons. We also point out inference is mixed concerning MSPNs, while our model is WMI 
based, and by doing so we can perform complex querying with respect to arbitrary intervals. Contrast that 
with the MSPN SQL approach [30] which uses the piecewise linear constants for inference on continuous 
random variables and is generally dissimilar to our methods (see Section 5) as SQL queries are more 
expansive. [31] also explores inference with MSPNs and replaces the piecewise linear approximations 
mapped to leaf nodes with a likelihood dictionary where parameters are learned via Gibbs sampling. Again, 
our model is agnostic to the distribution of the data and learns piecewise polynomial approximation which 
augments our inference mechanism with complex querying. Future research would benefit from the 
integration of all these methods, and we also find we are in a good position to go beyond intervals and 
consider oblique supports (see Section 7). 


7. CONCLUSION 


Deep architectures are powerful learning paradigms that capture the latent structure and have proven to 
be very successful in machine learning. Guessing the right architecture for complex data is challenging, 
and so paradigms such as SPNs are attractive alternatives in providing a unsupervised learning regime in 
addition to robust inference computations. 
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Here, we pushed the envelope further to consider a systematic integration of SPNs and WMI, allowing 
us to learn tractable, non-parametric distributions and convex combinations thereof for hybrid data, also 
in a unsupervised fashion. The integration was achieved by minimally adapting the base case for an SPN 
structure learning module, which makes our approach generic to a large extent. Different from our 
predecessors, we show for the first time how tractable distributions of arbitrary granularity can be learned, 
and more importantly, how to query these distributions over a rich interval syntax. Our empirical results 
show that our implemented system is effective, scalable and incurs a very little cost for handling continuous 
features, all of which are very desirable for learning from big uncertain data. One interesting direction is 
to extend the underlying constraint language to handle oblique supports (e.g. x > y and x + y > 2) rather 
than only intervals. A second challenge for the future includes extending the query language with features 
like counting operators, which would allow us to reason about the cardinality of sets of objects in an image, 
thus enabling an interface for commonsensical reasoning within deep architectures. 
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